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Abstract 

This paper presents a grammar and 
style checker demonstrator for Span- 



ish and Greek native writers devel- 



oped within the project GramCheck. 
Besides a brief grammar error typol- 
ogy for Spanish, a linguistically moti- 
vated approach to detection and diag- 
nosis is presented, based on the gen- 
eralized use of PROLOG extensions 
to highly typed unification-based gram- 
mars. The demonstrator, currently in- 
cluding full coverage for agreement er- 
rors and certain head-argument relation 
issues, also provides correction by means 
of an analysis-transfer-synthesis cycle. 
Finally, future extensions to the current 



system are discussed. 



1 Introduction 

Grammar checking stemmed as a logical appli- 
cation from former attempts to natural language 
understanding by computers. Many of the NLU 
systems developed in the 70 's included a kind of 
error recovery mechanism ranging from the treat- 



ment only of s pelling errors, PARRY ( [Parkin- 
son et ai, 197?]), to the inclusion also of incom- 



DER/LIFER (Hendrix et ai, 1977) 



The interest in the 80's begun to turn consid- 



own right (|Carboncll & Hayes, 1982), (Hayes & 


Mouradian, 1981 


), (Hcidorn et al, 1982 


), ( 


Jensen 


et at, 1983 


), though many of the a 


aproaches 



the seven major applications of NLP. Currently, 
every project in grammar checking has as its goal 
the creation of a writing aid rather than a robust 
man-machine interface ( Adriaens, 1994| ), (Bolioh 
et a/., 1992| ), ( |Vosse, 1992D ~ 



Current systems dealing with grammatical de- 
viance have been mainly involved in the integra- 
tion of special techniques to detect and correct, 
when possible, these deviances. In some cases, 
these have been incorporated to traditional pars- 
ing techniques, as it is the case with feature re- 
laxation in the context of unification-based for- 



malisms (Bolioli et ai, 1992), or the addition of a 
set of catching error rules specially handling the 
deviant constructions (Thurmair, 1990). In other 



cases, the relaxation component has been included 
as a new add-in feature to the parsing algorithm. 



as in the IBM's PLNLP approach (Hcidorn et ai. 
1982| ), or in the work developed for the Transla- 



tor's W orkbench pro ject using the METAL MT- 
system ( [TWB, 1992| ). 

Besides, an increasing concern in current 
projects is that of linguistic relevance of the anal- 
ysis performed by the grammar correction system. 
In this sense, the adequate integration of error 
detection and correction techniques within main- 
stream grammar formalisms has been addresse d 
by a number of these projects (Bolioli et ai, 1992|), 
(IVosse, 1992| ), ( [Gcnthial et al. ' 1992|) , ( [Genthial ei 



plcte input con t aining some kind o f ellipsis, LAD- ai, 19941 ) 



(IWeischedel fc Black, 198(\ ), ( [Weischedel fc Sond- 



FoUowing this concern, this paper presents re- 
sults from the project GramCheck (A Gram- 
mar and Style Checker, MLAP93-11), funded by 
the CEC. GramCheck has developed a grammar 
checker demonstrator for Spanish and Greek na- 



were still in the NLU tradition ([Charniak, 1983), five writer s using ALEP (|ET6/1, 1991]) , ( |Simp- 



Granger, 198S), (Kwasny fc Sondheimcr, 1981), kins, 1994 ) as the NLP development platform, a 



heimer, 1983 ). A 1985 Ovum report on natu- 



ral language applications ( Johnson, 198?: ) already 
identifies grammar and style checking as one of 



client-server architecture as implemented in the 
X Windows system. Motif as the 'look and feel' 
interface and Xminfo as the knowledge base stor- 
age format. Generalized use of extensions to the 



highly typed and unification based formahsm im- 
plemented in ALEP has been performed. These 
extensions (called Constraint Solvers, CSs) are 
nothing but pieces of PROLOG code performing 
different boolean and relational operations over 
feature values. Besides, GramCheck has used 
ongoing resuhs from LS-GRAM (LRE61029), a 
project aiming at the implementation of middle 
coverage ALEP grammars for a number of Euro- 
pean languages. 

The demonstrator checks whether a document 
contains grammar errors or style weaknesses and, 
if found any, users are provided with messages, 
suggestions and, for grammar errors only, auto- 
matic correction(s). 

2 Brief grammar error typology 
for Spanish 

The linguistic statements made by developers of 
current grammar checkers based on NLP tech- 
niques are often contradictory regarding the types 
of errors that grammar checkers must correct au- 
tomatically. ( Veronis, 198§| ) claims that native 



writers are unlikely to produce errors involving 
morphological features, while ( |Vosse, 1992| ) ac- 
cepts such morpho-syntactic errors, in spite of the 
fact that an examination of texts by the author 
revealed that their appearance in native writer's 
texts is not frequent. Both authors agree in char- 
acterizing morpho-syntactic errors as a sample of 
lack of competence. 

On the other hand, an examination of real texts 
produced by Spanish writers revealed that they do 
produce morpho-syntactic errors]^. Spanish is an 
inflectional language, which increases the possi- 
bilities of such errors. Nevertheless, other errors 
related to structural configuration of the language 
are produced as well. 

Errors found fall into one of the following sub- 
types, assuming that featurization is the technique 
used in parsing sentences: 

1. Mismatching of features that do not af- 
fect representational issues (intra- or inter- 
syntactic agreement on gender, number, per- 
son and case for categories showing this 
phenomenon). These mismatchings produce 
non-structural errors. 

2. Mismatching of features which describe cer- 
tain representational properties for cate- 



gories, as wrong head-argument relations, 
word order and substitution of certain cat- 
egories. These mismatchings produce struc- 
tural errors. 

Table |l] shows the percentages of different types 
of errors found in the corpus. Punctuation errors 
must be considered as structural violations, while 
for style weaknesses, it depends on its subtype. 
Errors at the lexical level are difficult to classify 
and most of them must regarded as spelling rather 
than grammar errors. The number of errors iden- 
tified in this corpus is 543. These statistics could 
be regarded as a representative average of the fre- 
quency of errors/mistakes occurring in Spanish 
texts. 

Table 1 : Statistics of errors 



Type of error 


Percent age 


Non-structural violations as described above 


18.5 


Structural violations as described above 


9.7 


Punctuation 


Omission 
Addition 


32.2 
4.8 


Errors at tlf« 
lexical levc. " 


At the character level 
Stress 


6.3 
8.0 


Style 

weaknesses 


Structural 
Lexical 


3.5 
12.0 


Others 


5.0 



^The corpus used contains nearly 70,000 words in- 
cluding text fragments from literature, newspapers, 
technical and administrative documentation. It has 
been provided to a large extent by GramCheck pilot 
user, ANAYA, S.A. 



"Errors at the lexical level include typing errors, 
word segmentation {si no vs sino), and cognitive errors 
(onceavo (partitive) vs undecimo (ordinal). 

A presupposition adopted in the project led to 
the idea that violations at the feature level can be 
captured by means of the relaxation of the possi- 
bly violated features while violations at the level of 
configuration may not be relaxed without raising 
unpredictable parsing results, thus being candi- 
dates for the implementation of explicit rules en- 
coding such incorrect structures. 

Under this view, a comprehensive grammar 
checker must make use of both strategies, called in 
the literature feature or constraint relaxation 
and error anticipation, respectively. 

However, given the relevance of features in the 
encoding of linguistic information in TFSs, some 
structural errors can be reanalyzed as agreement 
errors in a wide sense (as feature mismatching 
violations rather than structural ones). This al- 
lows the implementation of a uniform approach to 
grammar correction, thus avoiding explicit rules 
for ill- formed input. This paper describes such 
implementation for both non-structural and struc- 
tural violations. 

3 Error detection, diagnosis and 
correction techniques 

The overall strategy for detection, diagnosis and 
correction of grammar and style errors within 



GramCheck relies on three axes: 

• For detection, a combined feature relaxation 
and error anticipation approach is adopted. 
In order to implement the former, extensive 
use of external CSs is performed in the anal- 
ysis grammar, whereas for the latter, explicit 
rules, adequately defined either in the core 
grammar or in satellite subgrammars, are im- 
plemented]^. 

• Diagnosis is performed by the CSs themselves 
with the aid of a heuristic technique for those 
errors where tests should be performed on 
several elements and a pattern-related tech- 
nique which provides a means to extend fea- 
ture values with a gradation of correct and 
posible but incorrect information. The typ- 
ical case for the former is agreement, thus 
for signs involving this type of information, 
both an initial heuristic value is assigned 
and arithmetical operations are performed on 
(in)cquality tests. As for the latter, head- 
argument relations where bound prepositions 
are involved are treated this way. For all 
grammar errors there is no notion of weak vs. 
strong diagnosis, being all considered strong 
errors needing automatic correction. 

• Correction is performed by means of tree 
transduction of Linguistic Structures (LSs) 
containing errors to a 'language' (actually 
a 'language use') defined as correct Spanish. 
These synthesized structures are displayed to 
the user. The overall design is then similar to 
a transfer-based MT system, where the usual 
cycle is analysis-transfer-synthesis, being the 
main differences the addition of the above- 
mentioned grammar correction devices and 
the fact that not all, but only incorrect sen- 
tences, will be pushed through the complete 
cycle. 

CSs allow the relaxation of certain features in 
the grammar rules whose unification will be de- 
cided upon, in a non-trivial way, within these CSs. 
Thus, rules do not perform feature value checking, 
so CSs play a crucial role performing extended 



^GramCheck checks texts belonging to the stan- 
dard language and to the administrative sublanguage. 
The analysis module has been conceived as composed 
by a core grammar and two satellite subgrammars — 
for overlapping cases — that are mutually exclusive. 
Thus, the activation of one subgrammar implies the 
deactivation of the other, and in both cases they are 
added to the core grammar depending of the type of 
(sub)language selected by the user. 



variable unification and taking appropriate deci- 
sions. Depending on the error type, CSs carry out 
different operations on features, scores and lists. 
These operations concern basically the detection 
and the evaluation of the error, providing a diag- 
nostic on the error and correct value(s) for features 
involved. The use of CSs favours a one step di- 
agnosis procedure where decisions are only taken 
once all the relevant information is gathered. 

3.1 A heuristic technique to diagnose 
non-structural errors 

CSs are used in GramCheck roughly in the way 
presented in ( Crouch, 1994 ) and ( Ruessink, 1994 ), 
developers of CSs for ALEP. Nevertheless, while 
these reports describe constraint solving as an ex- 
tension to the expressive power of grammar rules, 
the novelty of the approach presented here is the 
use of CSs to allow feature relaxation in rules and 
boolean, relational and arithmetical operations on 
relevant features. 

Agreement errors pose a problem for a gram- 
mar checker which parses natural language, be- 
cause the detection of the error and the diagnosis 
procedure have to be performed automatically. In 
inflectional languages, like Spanish, this issue is 
essential given that in certain contexts it is not 
possible to give a single correction when perform- 
ing analysis only at sentence level (i.e. without 
anaphoric relations). For these cases, the system 
should be provided with a heuristics for the correc- 
tion in order to detect and diagnose the place(s) 
where the error(s) has(have) been produced and 
to take a decision about the unit(s) to be cor- 
rected. For GramCheck, this heuristics relies on a 
parametrization of two assumptions: 

1. the constituent which holds the feature values 
that in a given error situation control the rest 
of the feature values in the other constituents, 

2. the evaluation of the number of constituents 
which share and do not share the same values. 

Our diagnosis procedure assumes that the gen- 
der and number features in the head of a phrase 
control those in the dependent constituent (s), al- 
though, as it will be proved later, this is not nec- 
essarily true. In order to do this diagnosis proce- 
dure, the CS will contrast those features and leave 
some clues of this evaluation in phrasal projections 
in order for these to be available for further op- 
erations should they were necessary. These clues 
are shaped as scores in the approach adopted for 
agreement errors, and, in this sense, our heuristics 
is closed to the metric operations performed by 



other grammar checkers based on NLP techniques 
(IVeronis, 1988D, (|Bohoh et al, 1992| ) 
1992|), ( penthial et al, 199^ 



( Vosse 



The core of this heuristics is that depending 
on a set of hnguistic principles based on lexico- 
morphological properties, the values for gender 
and number in certain lexical units will be pro- 
moted over the values in other units, thus, assign- 
ing them a higher score. 

There are several conditions which have to be 
taken into account in order to perform the diag- 
nosis procedure. For instance, nouns with inher- 
ent gender should control the gender of the rest 
of the elements in a given NP. However, if the 
noun does not have inherent gender — it's a noun 
that shows sex inflection — then the gender value 
should be controlled by those elements that, shar- 
ing the same value, are majority. Hence, a se- 
quence like eLmasc casa_fem (the house) must be 
corrected into /a_fem casa_fem because this noun 
has inherent feminine gender in Spanish. On the 
other hand, an NP like Za_fem chic-ojcuasc guap- 
aJem (lit. 'the boy beautiful') should be corrected 
as la_fem chic-ajem guap-ajBm ('the girl beauti- 
ful'), thus changing the gender value of the head 
noim in the direction suggested by the other de- 
pendent elements. This means that although the 
system could take the gender value of the head as 
the value which commands the whole phrase, the 
number of elements that share the same feature 
values, if in contrast to those of the head and if 
the head takes its agreement properties from mor- 
phology — ie. are susceptible of keystroke errors — 
, can influence the final decision. Finally, for cases 
where equal scores are obtained, as it happens 
with a non-inherent masculine noun and a fem- 
inine determiner, both possible corrections should 
be performed, since there is not enough informa- 
tion so as to decide the correct value (unless this 
can be obtained from other agreeing elements in 
the sentence — for instance an attribute to this 
NP). 

Basically, the final operation to be performed 
with the scores is to determine that the higher the 
score of an element the severer its substitution. 
Thus, scores are clues for the correction of those 
elements having the lowest scores. 

The initialization steps in order to perform the 
heuristic technique are related to the assignment 
of values and scores to lexical projections depend- 
ing on its inherentness. The values for gender 
and number of the head of the projection serve 
as a parameter for the computation of values and 
scores for the possible modifier which could ap- 
pear closed to it. Note that agreement in Span- 



ish is based on a binary value system. Thus, the 
computation of values for the modifier of a given 
head simply relies on the instantiation of opposite 
values to those of the head. In the case of under- 
specification of the head for gender, for instance, 
the presupposition is that this value is the same 
as the one of the modifier, if this is not under- 
specified. Otherwise, both elements remain un- 
derspecified. Besides, the weight given to control- 
ling elements (50) ensures that there is no way for 
modifiers to overpass this score. Note as well that 
the weight given to inherentless values, as number 
(10), ensures that there are no promoted elements 
in this calculation. The following schematic CS 
illustrates the assignment of scores: 



aiidC=CScore_nuinber_head, 10) , 
andC 

or CandC= CInherentness_head, yes) , 

andC=CInherentness_head,no) , 
= CScore_number_mod, 0) ) ) 



=CScore_gender_head, 50) ) , 
-CScore_gender_head , 10) ) ) , 



The following steps to be performed by CSs are 
related to the addition of all those scores associ- 
ated to a given value in the successive rules build- 
ing the nominal projection and the percolation of 
morphosyntactic features: 

orC 

and Cor C= (Gender _head_mot her , Gender_mod) , 
=CGender_mod,masc_f em) ) , 
and(num_add(HGEN_SCDRE_HEAD, 
MGEN_SCDRE_MDD, 
MGEN_SCDRE_MDTHER) , 
andC= (Gender _mod_head, Gender_mod_mother) , 
num_add (HGEN_SCDRE_HEAD , 
HGEN_SCORE_HOD , 
HGEN_SCDRE_HDTHER) ) ) ) , 
and C= (Gender _mod , Gender_mod_head) , 
and(num_add(HGEN_SCDRE_HEAD, 
MGEN_SCDRE^MDD, 
HGEN_SCDRE_MOTHER) , 
and (= (Gender_mod_head , Gender_mod_mother) , 
num_add (MGEN_SCORE_HEAD , 
HGEN_SCDRE.MDD , 
MGEN.SCORE.MQTHER) ) ) ) ) 

The final evaluation performed by CSs is 
done when categories showing agreement over- 
pass their maximal projection, only if no other 
inter-syntagmatic agreement must be taken into 
account (as it is the case with subject-attribute 
agreement, for instance). Postponing in this 
way the final evaluation ensures that the CS will 
take into account all the previous parameters to 
give an appropriate diagnosis about the complete 
XP containing the agreement violation. This 
evaluation is based on the comparison of scores 
by means of the 'greater than' predicate in or- 
der to determine (a) the correct value for the 
feature(s) checked corresponding to the highest 
score(s) (Right_Gender , RightJJumber in the ex- 
ample below), to be used by the transfer module, 
and (b) the error diagnosis (gender, number and 
gender_number below), to be used by the error 
handling module that will display appropriate er- 
ror information to the user: 



andC 
or( 

or (and(n™.gt (HGEB.SCORE.NOUK , MGEN.SCDRE.NOIIN) , 

= (Geiid8r_Nomi, Right _GendGr) ) , 
and (imm_gt (MGEN_SCORE_NOUN , HGEN_SCORE_NOUN) , 

= CGender_Mod,Right_Gender))) , 
-(HGEN_SC0RE.110UN,MGEN_SC0RE_N0UM) ) , 

or( 

or (and(num_gt (HBUM.SCORE.MOUM , MnUM.SCORE.nOUB) , 

=CNumber_Noun ,Right_Number) ) , 
anddmm.gt CMNnM_SCORE_NOUN,HtJUM.SCaRE_NOUN) , 

=CNumber _Mod, Right _Number) ) ) , 
= CHNBM_SCORE_NOUN , MNUM_SCORE_NOUN) ) ) , 

If all elements agree, scores for one of the argu- 
ments will always be 0, while if this argument has 
a value different than 0, this information is con- 
sidered as an evidence that an error has occurred, 
the subsequent comparison determining the value 
for the winning score: 

or( 

and("(MHUM.SCDRE.MDTHER, 0) , 

or(=(MGEH_SCORE_MaTHER, 0) , 

aiid(iniiii_gt(MGEN_SCCRE_MOTHER,0) , 
= (ERTYPE , gender) ) ) ) , 
and(nnm_gt(MKUM_SCORE.MOTHER,0) , 
or(and(-(MGEN.SCDRE_MDTHER,0) , 
=(ERTYPE,number)), 
and(nijm_gt(MGEN_SCORE_MOTHER,0) , 
= CERTYPE , gender_number ) ) ) ) ) 

3.2 A pattern-related technique to 
perform structural error 
detection / diagnosis 

Turning back to the general definitions on error 
types given at the beginning of this document, 
structural violations can be seen as special cases 
of feature mismatching produced by addition, sub- 
stitution and omission of elements which result in 
a wrong dependency relation: 

Wrong head-argument relations 

(i) Substitution of a bound preposition by another 
one (PP i-> PP) 

Los alumnos relacionan la tarea [*a/con] su 
conocimiento. 

(ii) Omission of a bound preposition resulting in 
a change of the subcategorized argument (PP 

NP/S) 

Se acordo [*/de] que tenia una reunion por la 
manana. 

(iii) Addition of a preposition resulting in a change 
of the subcategorized argument (NP/S i— > PP) 

Las empresas demandan [*de] metodos. 

In the HPSG-likc grammar used, bound prepo- 
sitions are considered NPs attached to the subcat 
list (ie. the subcategorization feature) of a pred- 
icative unit. These NPs have the feature pform 
instantiated to the value of the preposition, if any. 



If the argument does not have a bound preposi- 
tion, the value for pform is none. Thus, the ap- 
proach adopted within GramCheck is that these 
error cases have a correct representation of the de- 
pendency structure where the only offending in- 
formation is stored as a feature in the governed 
element. 

The linguistic principle behind the pattern- 
related technique is based on the fact that na- 
tive writers substitute a preposition by another 
one when certain associations between patterns, 
showing either the same lexico-semantic and/or 
syntactic properties, are performed. Thus, this 
kind of error is not so accidental as it could be 
imagined. 

For instance, Spanish speakers/writers usually 
associate the argument structure of the compar- 
ative adjective inferior (lower), which subcatego- 
rizes the preposition a (to) , with the Spanish com- 
parative syntactic pattern [menos ... que, less ... 
than) whose second term is introduced by the con- 
junction que, producing phrases such as * inferior 
que instead of inferior a. With the verb relacionar 
(to relate), something similar occurs: this verb 
subcategorizcs for the preposition con; however, 
due to the fact that there exists the prepositional 
multi-word units en relacidn a and en relacidn 
con, speakers tend to think that the same preposi- 
tional alternation can be performed with the verb 
{* relacionar a vs. relacionar con). 

Following this idea, configurational rules are re- 
garded, for grammar checking, as descriptions of 
patterns, each of them having associated a wrong 
pattern linked to the correct pattern. Both pat- 
terns are in a complementary distribution. This 
way, structural errors can be foreseen and con- 
trolled, and the system is provided with a mech- 
anism which establishes the way rule constraints 
must be relaxed. 

To cope with this error, a CS operating on lists 
checks whether the preposition in the constituent 
attached to the predicative sign belongs to the 
head of the list or to the tail. If the preposition 
is member of the tail, the same actions shown for 
agreement errors are performed — instantiation of 
the correct value and determination of the error 
type. 

4 Error coverage 

The current version of the GramCheck demonstra- 
tor is able to deal with the following types of er- 
rors: 

• Intra- and inter-syntagmatic agreement er- 
rors (gender and/or number in active — with 



both predicative and copulative verbs — and 
passive sentences). 

• Direct objects: omission of the preposition a 

with an animate entity and addition of such 
a preposition with a non-animate entity. 

• Addition, omission and substitution of a 
bound preposition covering what is called 
dequeismo — the addition of a false bound 
preposition de with clausal arguments and 
queismo — the omission of the bound prepo- 
sition de with clausal arguments. 

• Errors on portmanteau words (use de el, a el 
instead of del, at). 

Regarding style issues, three different types of 
weaknesses are detected: structural weaknesses, 
lexical weaknesses and abusive use of passive, 
gerunds and manner adverbs. While structural 
weaknesses are detected in the phrase structure 
rules using CSs (noim + "a" + infinitive), by 
means of an error anticipation strategy, lexical 
weaknesses are detected at the lexical level, with 
no special mechanisms other than simple CSs. 
Lexical errors currently detected are related with 
the use of Latin words which it is better to avoid, 
foreign words with Spanish derivation, cognitive 
errors, foreign words for which a Spanish word is 
recommended and verbosity. 

5 Further developments 

Results obtained with the current demonstrator 
arc very promising. The performance of the sys- 
tem using CSs is similar to that shown without 
them, hence its use in conjunction with the detec- 
tion techniques proposed, rather than a burden, 
may be seen as a means to add robustness to NLP 
systems. In fact, CSs may provide more natural 
solutions to grammar implementation issues, like 
PP-attachment control. 

Several directions for further developments have 
already been defined. These include the integra- 
tion of these grammar checking techniques into 
the final release of the LS-GRAM Spanish gram- 
mar, wliicli will have a more realistic coverage in 
terms both of linguistic phenomena and lexicon. 
Besides, on this new version of the grammar, hy- 
brid techniques will be used, taking advantage of 
the preprocessing facilities included in ALEP. In 
particular, while for errors like those presented in 
this paper the approach adopted is linguistically 
motivated, for certain punctuation errors (or sim- 
ply in order to reduce lexical ambiguity) other 



relatively simple means can be defined that in- 
clude certain extended pattern matching on reg- 
ular expressions or the passing of linguistic in- 
formation gathered in a preprocessing phase to 
the unification-based parser. It is also foreseen to 
include a treatment for cognitive spelling errors, 
usually not dealt with by conventional spelling 
checkers. 
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